R Technology Workshop

R is the most popular free software environment for statistical computing and graphics. ggplot2 is a data visualization package for R that can be used to produce publication-quality graphics. This workshop is designed to introduce you to R and ggplot as well as RStudio, KnitR, Slidify, and Shiny.
R is a central piece of the Big Data Analytics Revolution, for example, see http://opensource.com/business/14/7/interview-david-smith-revolution-analytics for an article entitled “Big data influencer on how R is paving the way”

This is how my RStudio is configured:

sessionInfo()
## R version 3.2.1 (2015-06-18)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
## Running under: OS X 10.8.5 (Mountain Lion)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] magrittr_1.5    tools_3.2.1     htmltools_0.2.6 yaml_2.1.13    
##  [5] stringi_0.5-5   rmarkdown_0.8   knitr_1.11      stringr_1.0.0  
##  [9] digest_0.6.8    evaluate_0.7.2

You also need to install LaTeX if you want to generate PDF files from KnitR.

http://latex-project.org/ftp.html

Getting Started - Clone the RWorkshop GiT Repository:

Use a GUI tool like SourceTree to clone the repository or execute the following commands in a terminal window:

Phils-MacBook-Pro:Mine pcannata$ pwd
/Users/pcannata
Phils-MacBook-Pro:~ pcannata$ git clone https://github.com/pcannata/DataVisualization.git
Cloning into ‘DataVisualization’… remote: Counting objects: 74, done. remote: Compressing objects: 100% (60/60), done. remote: Total 74 (delta 6), reused 67 (delta 4) Unpacking objects: 100% (74/74), done. Checking connectivity… done.
Phils-MacBook-Pro:~ pcannata$ ls -a DataVisualization/
. .. .git README.md RWorkshop

Getting Started - Create a New RStudio Project for the code in the cloned repository:

Getting Started - Create a .Rprofile file to load libraries when the project is started:

Create an new file text named .Rprofile.

Put the following into .Rprofile
require(“ggplot2”) require(“ggthemes”) require(“gplots”) require(“grid”) require(“RCurl”) require(“reshape2”) require(“rstudio”) require(“tableplot”) require(“tidyr”) require(“dplyr”) require(“jsonlite”) require(“extrafont”) require(“lubridate”)

Be sure to put a newline after the last require statement.

High Level Overview - Creating an Excel-like Chart in R - see the 00 Overview Folder in the DrCannata/Rworkshop Repository

This is something that is easily done in Excel:

How would you do the same thing in R?

source("../00 Overview/Overview.R", echo = TRUE)
## 
## > x <- c(1, 2, 3, 4, 5)
## 
## > y <- 3 * x
## 
## > y1 <- 2^x
## 
## > x
## [1] 1 2 3 4 5
## 
## > y
## [1]  3  6  9 12 15
## 
## > y1
## [1]  2  4  8 16 32
## 
## > df <- data.frame(x, y, y1)
## 
## > df
##   x  y y1
## 1 1  3  2
## 2 2  6  4
## 3 3  9  8
## 4 4 12 16
## 5 5 15 32
## 
## > require(reshape2)
## Loading required package: reshape2
## 
## > mdf <- melt(df, id.vars = "x", measure.vars = c("y", 
## +     "y1"))
## 
## > mdf
##    x variable value
## 1  1        y     3
## 2  2        y     6
## 3  3        y     9
## 4  4        y    12
## 5  5        y    15
## 6  1       y1     2
## 7  2       y1     4
## 8  3       y1     8
## 9  4       y1    16
## 10 5       y1    32
## 
## > require(ggplot2)
## Loading required package: ggplot2
## 
## > ggplot(mdf, aes(x = x, y = value, color = variable)) + 
## +     geom_line()

See also http://cran.r-project.org/doc/manuals/r-devel/R-lang.html, http://www.r-tutor.com/r-introduction, and http://www.cookbook-r.com/

R Dataframes - see the 02 R Dataframes Folder in the DrCannata/Rworkshop Repository

A data frame is used for storing data tables. It is a list of vectors of equal length. For example, the following variable df is a data frame containing three vectors n, s, b.

n = c(2, 3, 5) 
s = c("aa", "bb", "cc") 
b = c(TRUE, FALSE, TRUE) 
df = data.frame(n, s, b)       # df is a data frame
head(df)
##   n  s     b
## 1 2 aa  TRUE
## 2 3 bb FALSE
## 3 5 cc  TRUE

Dataframes can be loaded from databases, CSVs, Excel, etc.. Loading dataframes from an Oracle database will be discussed later in this Workshop.

See also http://www.r-tutor.com/r-introduction/data-frame

Many R packages come with demo dataframes. The ggplot package comes with a demo dataframe called diamonds which we will use for this workshop.

source("../01 R Dataframes/Dataframes.R", echo = TRUE)
## 
## > require("ggplot2")
## 
## > "Displaying the top few rows of a dataframe:"
## [1] "Displaying the top few rows of a dataframe:"
## 
## > head(diamonds)
##   carat       cut color clarity depth table price    x    y    z
## 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48
## 
## > "Summary of each variable in the dataframe."
## [1] "Summary of each variable in the dataframe."
## 
## > names(diamonds)
##  [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
##  [8] "x"       "y"       "z"      
## 
## > `?`(diamonds)
## 
## > summary(diamonds)
##      carat               cut        color        clarity     
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655  
##                                     J: 2808   (Other): 2531  
##      depth           table           price             x         
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000  
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710  
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
##  Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731  
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540  
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
##                                                                  
##        y                z         
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.710   Median : 3.530  
##  Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :58.900   Max.   :31.800  
##                                   
## 
## > "Selecting a subset of columns from a dataframe:"
## [1] "Selecting a subset of columns from a dataframe:"
## 
## > head(subset(diamonds, select = c(carat, cut)))
##   carat       cut
## 1  0.23     Ideal
## 2  0.21   Premium
## 3  0.23      Good
## 4  0.29   Premium
## 5  0.31      Good
## 6  0.24 Very Good
## 
## > "Selecting a subset of rows from a dataframe:"
## [1] "Selecting a subset of rows from a dataframe:"
## 
## > head(subset(diamonds, cut == "Ideal" & price > 5000))
##       carat   cut color clarity depth table price    x    y    z
## 11417  1.16 Ideal     E     SI2  62.7  56.0  5001 6.69 6.73 4.21
## 11418  1.16 Ideal     E     SI2  59.9  57.0  5001 6.80 6.82 4.08
## 11422  1.07 Ideal     I     SI1  61.7  56.1  5002 6.57 6.59 4.06
## 11423  1.10 Ideal     H     SI2  62.0  56.5  5002 6.58 6.63 4.09
## 11424  1.20 Ideal     J     SI1  62.1  55.0  5002 6.81 6.84 4.24
## 11431  1.14 Ideal     H     SI1  61.6  57.0  5003 6.70 6.75 4.14
## 
## > "Find average price group by color (plyr package is needed)"
## [1] "Find average price group by color (plyr package is needed)"
## 
## > require("plyr")
## Loading required package: plyr
## 
## > ddply(subset(diamonds, cut == "Ideal" & price > 5000), 
## +     ~color, summarise, o = mean(price, na.rm = TRUE))
##   color        o
## 1     D 9056.612
## 2     E 9065.486
## 3     F 9704.489
## 4     G 9392.281
## 5     H 8923.306
## 6     I 9663.031
## 7     J 9406.772

For more on subsetting dataframes see http://www.ats.ucla.edu/stat/r/faq/subset_R.htm

Connecting to Oracle with RestfulReL

source("../02 RESTful Data Access/Access Oracle Database.R", echo = TRUE)
## 
## > require("jsonlite")
## Loading required package: jsonlite
## 
## Attaching package: 'jsonlite'
## 
## The following object is masked from 'package:utils':
## 
##     View
## 
## > require("RCurl")
## Loading required package: RCurl
## Loading required package: bitops
## 
## > df <- data.frame(fromJSON(getURL(URLencode("129.152.144.84:5001/rest/native/?query=\"select * from emp order by job\""), 
## +     httpheader = c(DB =  .... [TRUNCATED] 
## 
## > df
##    EMPNO  ENAME       JOB  MGR            HIREDATE  SAL COMM DEPTNO
## 1   7788  SCOTT   ANALYST 7566 1982-12-09 00:00:00 3000 null     20
## 2   7902   FORD   ANALYST 7566 1981-12-03 00:00:00 3000 null     20
## 3   7934 MILLER     CLERK 7782 1982-01-23 00:00:00 1300 null     50
## 4   7900  JAMES     CLERK 7698 1981-12-03 00:00:00  950 null     30
## 5   7369  SMITH     CLERK 7902 1980-12-17 00:00:00  800 null     20
## 6   7876  ADAMS     CLERK 7788 1983-01-12 00:00:00 1100 null     20
## 7   7698  BLAKE   MANAGER 7839 1981-05-01 00:00:00 2850 null     30
## 8   7566  JONES   MANAGER 7839 1981-04-02 00:00:00 2975 null     20
## 9   7782  CLARK   MANAGER 7839 1981-06-09 00:00:00 2450 null     10
## 10  7839   KING PRESIDENT null 1981-11-17 00:00:00 5000 null     10
## 11  7844 TURNER  SALESMAN 7698 1981-09-08 00:00:00 1500 null     30
## 12  7654 MARTIN  SALESMAN 7698 1981-09-28 00:00:00 1250 1400     30
## 13  7521   WARD  SALESMAN 7698 1981-02-22 00:00:00 1250  500     30
## 14  7499  ALLEN  SALESMAN 7698 1981-02-20 00:00:00 1600  300     30

ggplot2

ggplot is an R package for data exploration and visualization. It produces production quality graphics and allows you to slice and dice your data in many different ways. ggplot uses a general scheme for data visualization which breaks graphs up into semantic components such as scales and layers. In contrast to other graphics packages, ggplot2 allows the user to add, remove or alter components in a plot at a high level of abstraction.

See also http://ggplot2.org/, http://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf, and https://groups.google.com/forum/#!forum/ggplot2

source("../03 ggplot/Plots.R", echo = TRUE)
## 
## > options(java.parameters = "-Xmx2g")
## 
## > head(diamonds)
##   carat       cut color clarity depth table price    x    y    z
## 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48
## 
## > ggplot(data = diamonds) + geom_histogram(aes(x = carat))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## 
## > ggplot(data = diamonds) + geom_density(aes(x = carat, 
## +     fill = "gray50"))

## 
## > ggplot(diamonds, aes(x = carat, y = price)) + geom_point()

## 
## > p <- ggplot(diamonds, aes(x = carat, y = price)) + 
## +     geom_point(aes(color = color))
## 
## > p + facet_wrap(~color)

## 
## > p + facet_grid(cut ~ clarity)

## 
## > p <- ggplot(diamonds, aes(x = carat)) + geom_histogram(aes(color = color), 
## +     binwidth = max(diamonds$carat)/30)
## 
## > p + facet_wrap(~color)

## 
## > p + facet_grid(cut ~ clarity)

The Chapter 7 of “R for Everyone” has many more examples of ggplots.

ggplot2 and functions

source("../03 ggplot/plotFunction.R", echo = TRUE)
## 
## > FigureNum <<- 0
## 
## > ggplot_func <- function(df, Title = "Diamonds", Legend = "color", 
## +     PointColor = c("red", "blue", "green", "yellow", "grey", 
## +         "black" .... [TRUNCATED] 
## 
## > p1 <- ggplot_func(diamonds)
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.
## 
## > p1
## 
## > p2 <- ggplot_func(diamonds, YMin = 5000, YMax = 15000)
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.

## 
## > p2
## Warning: Removed 40868 rows containing missing values (geom_point).
## 
## > p3 <- ggplot_func(subset(diamonds, cut == "Premium"), 
## +     Legend = "cut")
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.

## 
## > p3
## 
## > p4 <- ggplot_func(diamonds, Legend = "clarity", PointColor = c("red", 
## +     "blue", "green", "yellow", "grey", "black", "purple", "orange"))
## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.

## 
## > p4
## 
## > library("grid")
## 
## > png("4diamonds.png", width = 25, height = 20, units = "in", 
## +     res = 72)
## 
## > grid.newpage()
## 
## > pushViewport(viewport(layout = grid.layout(2, 2)))
## 
## > print(p1, vp = viewport(layout.pos.row = 1, layout.pos.col = 1))
## 
## > print(p2, vp = viewport(layout.pos.row = 1, layout.pos.col = 2))
## Warning: Removed 40868 rows containing missing values (geom_point).

## 
## > print(p3, vp = viewport(layout.pos.row = 2, layout.pos.col = 1))
## 
## > print(p4, vp = viewport(layout.pos.row = 2, layout.pos.col = 2))
## 
## > dev.off()
## quartz_off_screen 
##                 2

You should now be able to open RWorkshop/00 Doc/4diamonds.png. It should look like the following plot.

KnitR

KnitR is an R package designed to generate dynamic reports using a mix of the R, LaTex, and the Rmarkdown (see http://rmarkdown.rstudio.com/?version=0.98.945&mode=desktop) languages.

See also http://yihui.name/knitr/ and http://kbroman.github.io/knitr_knutshell/

Simple examples can be found in “04 KnitR/doc1.Rmd” and “04 KnitR/doc2.Rmd”. These can generate html, pdf, and word documents. The output from Kniting doc2.Rmd is,

A comprehensive KnitR example (which generated this document) can be found in “00 Doc/RWorkshop.Rmd”.

slidify

You can use Slidify to generate HTML slide decks using only the Rmarkdown language.

See also http://slidify.org and http://slidify.org/start.html

Follow the instructions in “05 Slidify/slidify setup.R” to install and run slidify. You should be able to produce a slide deck with a first slide that looks something like the following.

Cool trick - Any github repo with a branch called gh-pages will get served as a website. If the content of that repo is the stuff of websites (html,css), then you get free web hosting. So, create a branch called gh-pages and push to it.

shiny

The shiny R package allows you to build interactive web-based applications using only R with no knowledge of html, css, or javascript needed. You just need to write two scripts (see the example files in the 06Shiny directory):

  • ui.R : Defines the layout and the interactive elements that the user can access.
  • server.R : Defines what computations are done in response to user interactions.

See also http://shiny.rstudio.com and http://shiny.rstudio.com/tutorial

To run the shiny app that’s in the 06Shiny directory run the following in the main RWorkshop directory (make sure the working directory is set to this directory):
library(shiny)
runApp(“06Shiny”) # Make sure there are no spaces in the string argument to runAPP

This should pop the application up in a browser, you can also access it in a browser at http://127.0.0.1:6837. It should look like the following.

shinyapps

The example above ran the shiny app on your local machine, but to share with others, you have to send around the R files and the user needs to have R and know a little bit about it.

Instead, you can remotely host shiny apps and then just send people links. Get a free account at shinyapps.io/signup.html and give it a try.

library(“devtools”, lib.loc=“/Library/Frameworks/R.framework/Versions/3.0/Resources/library”)
install_github( repo = “shinyapps”, username=“rstudio” )
shinyapps::setAccountInfo(name=‘pcannata’, token=‘3ECF447A741004F6A8B7208C9ED778E1’, secret=‘. . .’)

# library(shinyapps)
getwd()
## [1] "/Users/pcannata/Mine/UT/GitRepositories/DataVisualization/RWorkshop/00 Doc"
# Uncomment the following line to deploy the app.
#deployApp("../06Shiny")

Now you can try the app at https://pcannata.shinyapps.io/06Shiny/

See also https://www.shinyapps.io/ and http://shiny.rstudio.com/articles/shinyapps.html

Data Wrangling

See also http://cran.r-project.org/doc/manuals/r-devel/R-lang.html, http://www.r-tutor.com/r-introduction, and http://www.cookbook-r.com/

source("../07 Data Wrangling/Data Wrangling.R", echo = TRUE)
## 
## > require(tidyr)
## Loading required package: tidyr
## 
## > require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## 
## > tbl_df(diamonds)
## Source: local data frame [53,940 x 10]
## 
##    carat       cut color clarity depth table price    x    y    z
## 1   0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2   0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3   0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4   0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5   0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6   0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48
## 7   0.24 Very Good     I    VVS1  62.3    57   336 3.95 3.98 2.47
## 8   0.26 Very Good     H     SI1  61.9    55   337 4.07 4.11 2.53
## 9   0.22      Fair     E     VS2  65.1    61   337 3.87 3.78 2.49
## 10  0.23 Very Good     H     VS1  59.4    61   338 4.00 4.05 2.39
## ..   ...       ...   ...     ...   ...   ...   ...  ...  ...  ...
## 
## > View(diamonds)
## 
## > select(diamonds, cut, clarity) %>% tbl_df
## Source: local data frame [53,940 x 2]
## 
##          cut clarity
## 1      Ideal     SI2
## 2    Premium     SI1
## 3       Good     VS1
## 4    Premium     VS2
## 5       Good     SI2
## 6  Very Good    VVS2
## 7  Very Good    VVS1
## 8  Very Good     SI1
## 9       Fair     VS2
## 10 Very Good     VS1
## ..       ...     ...
## 
## > diamonds %>% select(cut, clarity) %>% tbl_df
## Source: local data frame [53,940 x 2]
## 
##          cut clarity
## 1      Ideal     SI2
## 2    Premium     SI1
## 3       Good     VS1
## 4    Premium     VS2
## 5       Good     SI2
## 6  Very Good    VVS2
## 7  Very Good    VVS1
## 8  Very Good     SI1
## 9       Fair     VS2
## 10 Very Good     VS1
## ..       ...     ...
## 
## > x <- diamonds %>% select(cut, clarity) %>% tbl_df
## 
## > diamonds %>% select(cut, clarity) %>% filter(cut == 
## +     "Good") %>% tbl_df
## Source: local data frame [4,906 x 2]
## 
##     cut clarity
## 1  Good     VS1
## 2  Good     SI2
## 3  Good     SI1
## 4  Good     SI1
## 5  Good     SI1
## 6  Good     SI2
## 7  Good     VS1
## 8  Good     VS1
## 9  Good     SI1
## 10 Good     VS2
## ..  ...     ...
## 
## > diamonds %>% select(cut, clarity) %>% filter(cut %in% 
## +     c("Good", "Fair")) %>% tbl_df
## Source: local data frame [6,516 x 2]
## 
##     cut clarity
## 1  Good     VS1
## 2  Good     SI2
## 3  Fair     VS2
## 4  Good     SI1
## 5  Good     SI1
## 6  Good     SI1
## 7  Good     SI2
## 8  Good     VS1
## 9  Good     VS1
## 10 Good     SI1
## ..  ...     ...
## 
## > diamonds %>% select(cut, clarity) %>% filter(cut %in% 
## +     c("Good", "Fair"), clarity == "VS1") %>% tbl_df
## Source: local data frame [818 x 2]
## 
##     cut clarity
## 1  Good     VS1
## 2  Good     VS1
## 3  Good     VS1
## 4  Good     VS1
## 5  Good     VS1
## 6  Good     VS1
## 7  Good     VS1
## 8  Good     VS1
## 9  Fair     VS1
## 10 Good     VS1
## ..  ...     ...
## 
## > diamonds %>% select(cut, clarity) %>% filter(cut %in% 
## +     c("Good", "Fair"), clarity == "VS1" | is.na(cut)) %>% tbl_df
## Source: local data frame [818 x 2]
## 
##     cut clarity
## 1  Good     VS1
## 2  Good     VS1
## 3  Good     VS1
## 4  Good     VS1
## 5  Good     VS1
## 6  Good     VS1
## 7  Good     VS1
## 8  Good     VS1
## 9  Fair     VS1
## 10 Good     VS1
## ..  ...     ...
## 
## > diamonds %>% select(cut, clarity, x, y, z) %>% filter(cut %in% 
## +     c("Good", "Fair"), clarity == "VS1" | is.na(cut)) %>% tbl_df
## Source: local data frame [818 x 5]
## 
##     cut clarity    x    y    z
## 1  Good     VS1 4.05 4.07 2.31
## 2  Good     VS1 4.06 4.08 2.37
## 3  Good     VS1 3.83 3.85 2.46
## 4  Good     VS1 4.19 4.24 2.46
## 5  Good     VS1 5.71 5.76 3.40
## 6  Good     VS1 5.81 5.77 3.31
## 7  Good     VS1 5.97 5.92 3.53
## 8  Good     VS1 5.74 5.72 3.48
## 9  Fair     VS1 5.89 5.80 3.46
## 10 Good     VS1 5.56 5.59 3.63
## ..  ...     ...  ...  ...  ...
## 
## > diamonds %>% select(cut, clarity, x, y, z) %>% filter(cut %in% 
## +     c("Good", "Fair"), clarity == "VS1" | is.na(cut)) %>% mutate(sum = x + 
## +      .... [TRUNCATED] 
## Source: local data frame [818 x 6]
## 
##     cut clarity    x    y    z   sum
## 1  Good     VS1 4.05 4.07 2.31 10.43
## 2  Good     VS1 4.06 4.08 2.37 10.51
## 3  Good     VS1 3.83 3.85 2.46 10.14
## 4  Good     VS1 4.19 4.24 2.46 10.89
## 5  Good     VS1 5.71 5.76 3.40 14.87
## 6  Good     VS1 5.81 5.77 3.31 14.89
## 7  Good     VS1 5.97 5.92 3.53 15.42
## 8  Good     VS1 5.74 5.72 3.48 14.94
## 9  Fair     VS1 5.89 5.80 3.46 15.15
## 10 Good     VS1 5.56 5.59 3.63 14.78
## ..  ...     ...  ...  ...  ...   ...
## 
## > ndf <- diamonds %>% select(cut, clarity, x, y, z) %>% 
## +     filter(cut %in% c("Good", "Fair"), clarity == "VS1" | is.na(cut)) %>% 
## +     mutate(sum .... [TRUNCATED] 
## 
## > ndf
## Source: local data frame [818 x 6]
## 
##     cut clarity    x    y    z   sum
## 1  Good     VS1 4.05 4.07 2.31 10.43
## 2  Good     VS1 4.06 4.08 2.37 10.51
## 3  Good     VS1 3.83 3.85 2.46 10.14
## 4  Good     VS1 4.19 4.24 2.46 10.89
## 5  Good     VS1 5.71 5.76 3.40 14.87
## 6  Good     VS1 5.81 5.77 3.31 14.89
## 7  Good     VS1 5.97 5.92 3.53 15.42
## 8  Good     VS1 5.74 5.72 3.48 14.94
## 9  Fair     VS1 5.89 5.80 3.46 15.15
## 10 Good     VS1 5.56 5.59 3.63 14.78
## ..  ...     ...  ...  ...  ...   ...
## 
## > pmin(c(1:5), (5:1))
## [1] 1 2 3 2 1
## 
## > pmax(c(1:5), (5:1))
## [1] 5 4 3 4 5
## 
## > c(1, 1, 2, 0, 4, 3, 5) %>% cummin()
## [1] 1 1 1 0 0 0 0
## 
## > c(1, 1, 2, 5, 4, 3, 5) %>% cummax()
## [1] 1 1 2 5 5 5 5
## 
## > c(1, 1, 2, 3, 4, 3, 5) %>% cumsum()
## [1]  1  2  4  7 11 14 19
## 
## > c(1, 1, 2, 3, 4, 3, 5) %>% cumprod()
## [1]   1   1   2   6  24  72 360
## 
## > c(1, 1, 2, 3, 4, 3, 5) %>% between(2, 4)
## [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
## 
## > c(1, 1, 2, 5, 4, 3, 5) %>% cume_dist()
## [1] 0.2857143 0.2857143 0.4285714 1.0000000 0.7142857 0.5714286 1.0000000
## 
## > c(1:5) %>% cume_dist()
## [1] 0.2 0.4 0.6 0.8 1.0
## 
## > c(1, 1:5) %>% cume_dist()
## [1] 0.3333333 0.3333333 0.5000000 0.6666667 0.8333333 1.0000000
## 
## > c(1:5) %>% cummean()
## [1] 1.0 1.5 2.0 2.5 3.0
## 
## > c(1:5) %>% lead() - c(1:5)
## [1]  1  1  1  1 NA
## 
## > c(1:5) %>% lag() - c(1:5)
## [1] NA -1 -1 -1 -1
## 
## > c(1:10)
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## > c(1:10) %>% ntile(4)
##  [1] 1 1 1 2 2 3 3 3 4 4
## 
## > diamonds %>% mutate(price_percent = cume_dist(price)) %>% 
## +     filter(price_percent <= 0.2) %>% arrange(desc(price_percent)) %>% 
## +     tbl_df
## Source: local data frame [10,783 x 11]
## 
##    carat       cut color clarity depth table price    x    y    z
## 1   0.38     Ideal     I    VVS2  62.0    55   836 4.65 4.67 2.89
## 2   0.41      Good     G     SI1  64.3    54   836 4.72 4.68 3.02
## 3   0.41     Ideal     G     SI2  61.0    57   836 4.82 4.79 2.93
## 4   0.41   Premium     H     SI2  62.2    56   836 4.80 4.72 2.96
## 5   0.30     Ideal     D     VS1  62.2    56   835 4.31 4.27 2.67
## 6   0.30     Ideal     D     VS1  61.0    57   835 4.32 4.31 2.63
## 7   0.35 Very Good     D    VVS2  60.0    58   835 4.57 4.59 2.75
## 8   0.41 Very Good     G     VS2  62.7    60   835 4.71 4.75 2.96
## 9   0.38 Very Good     F     VS1  60.4    60   835 4.70 4.74 2.85
## 10  0.36     Ideal     E     VS1  61.4    54   835 4.59 4.63 2.83
## ..   ...       ...   ...     ...   ...   ...   ...  ...  ...  ...
## Variables not shown: price_percent (dbl)
## 
## > bottom20_diamonds <- diamonds %>% mutate(price_percent = cume_dist(price)) %>% 
## +     filter(price_percent <= 0.2) %>% arrange(desc(price_percent))  .... [TRUNCATED] 
## 
## > diamonds %>% mutate(price_percent = cume_dist(price)) %>% 
## +     filter(price_percent >= 0.8) %>% arrange(desc(price_percent)) %>% 
## +     tbl_df
## Source: local data frame [10,791 x 11]
## 
##    carat       cut color clarity depth table price    x    y    z
## 1   2.29   Premium     I     VS2  60.8    60 18823 8.50 8.47 5.16
## 2   2.00 Very Good     G     SI1  63.5    56 18818 7.90 7.97 5.04
## 3   1.51     Ideal     G      IF  61.7    55 18806 7.37 7.41 4.56
## 4   2.07     Ideal     G     SI2  62.5    55 18804 8.20 8.13 5.11
## 5   2.00 Very Good     H     SI1  62.8    57 18803 7.95 8.00 5.01
## 6   2.29   Premium     I     SI1  61.8    59 18797 8.52 8.45 5.24
## 7   2.04   Premium     H     SI1  58.1    60 18795 8.37 8.28 4.84
## 8   2.00   Premium     I     VS1  60.8    59 18795 8.13 8.02 4.91
## 9   1.71   Premium     F     VS2  62.3    59 18791 7.57 7.53 4.70
## 10  2.15     Ideal     G     SI2  62.6    54 18791 8.29 8.35 5.21
## ..   ...       ...   ...     ...   ...   ...   ...  ...  ...  ...
## Variables not shown: price_percent (dbl)
## 
## > top20_diamonds <- diamonds %>% mutate(price_percent = cume_dist(price)) %>% 
## +     filter(price_percent >= 0.8) %>% arrange(desc(price_percent)) %>% .... [TRUNCATED] 
## 
## > diamonds %>% mutate(price_percent = cume_dist(price)) %>% 
## +     filter(price_percent <= 0.2 | price_percent >= 0.8) %>% ggplot(aes(x = price, 
## +    .... [TRUNCATED]

## 
## > diamonds %>% mutate(minxy = pmin(x, y)) %>% tbl_df
## Source: local data frame [53,940 x 11]
## 
##    carat       cut color clarity depth table price    x    y    z minxy
## 1   0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43  3.95
## 2   0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31  3.84
## 3   0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31  4.05
## 4   0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63  4.20
## 5   0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75  4.34
## 6   0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48  3.94
## 7   0.24 Very Good     I    VVS1  62.3    57   336 3.95 3.98 2.47  3.95
## 8   0.26 Very Good     H     SI1  61.9    55   337 4.07 4.11 2.53  4.07
## 9   0.22      Fair     E     VS2  65.1    61   337 3.87 3.78 2.49  3.78
## 10  0.23 Very Good     H     VS1  59.4    61   338 4.00 4.05 2.39  4.00
## ..   ...       ...   ...     ...   ...   ...   ...  ...  ...  ...   ...
## 
## > diamonds %>% mutate(cummin_x = cummin(x)) %>% tbl_df
## Source: local data frame [53,940 x 11]
## 
##    carat       cut color clarity depth table price    x    y    z cummin_x
## 1   0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43     3.95
## 2   0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31     3.89
## 3   0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31     3.89
## 4   0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63     3.89
## 5   0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75     3.89
## 6   0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48     3.89
## 7   0.24 Very Good     I    VVS1  62.3    57   336 3.95 3.98 2.47     3.89
## 8   0.26 Very Good     H     SI1  61.9    55   337 4.07 4.11 2.53     3.89
## 9   0.22      Fair     E     VS2  65.1    61   337 3.87 3.78 2.49     3.87
## 10  0.23 Very Good     H     VS1  59.4    61   338 4.00 4.05 2.39     3.87
## ..   ...       ...   ...     ...   ...   ...   ...  ...  ...  ...      ...
## 
## > diamonds %>% mutate(cumsum_x = cumsum(x)) %>% tbl_df
## Source: local data frame [53,940 x 11]
## 
##    carat       cut color clarity depth table price    x    y    z cumsum_x
## 1   0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43     3.95
## 2   0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31     7.84
## 3   0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31    11.89
## 4   0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63    16.09
## 5   0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75    20.43
## 6   0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48    24.37
## 7   0.24 Very Good     I    VVS1  62.3    57   336 3.95 3.98 2.47    28.32
## 8   0.26 Very Good     H     SI1  61.9    55   337 4.07 4.11 2.53    32.39
## 9   0.22      Fair     E     VS2  65.1    61   337 3.87 3.78 2.49    36.26
## 10  0.23 Very Good     H     VS1  59.4    61   338 4.00 4.05 2.39    40.26
## ..   ...       ...   ...     ...   ...   ...   ...  ...  ...  ...      ...
## 
## > diamonds %>% mutate(between_x = between(x, 4, 4.1)) %>% 
## +     tbl_df
## Source: local data frame [53,940 x 11]
## 
##    carat       cut color clarity depth table price    x    y    z
## 1   0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2   0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3   0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4   0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5   0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6   0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48
## 7   0.24 Very Good     I    VVS1  62.3    57   336 3.95 3.98 2.47
## 8   0.26 Very Good     H     SI1  61.9    55   337 4.07 4.11 2.53
## 9   0.22      Fair     E     VS2  65.1    61   337 3.87 3.78 2.49
## 10  0.23 Very Good     H     VS1  59.4    61   338 4.00 4.05 2.39
## ..   ...       ...   ...     ...   ...   ...   ...  ...  ...  ...
## Variables not shown: between_x (lgl)
## 
## > diamonds %>% mutate(lead_z = lead(z) - z) %>% tbl_df
## Source: local data frame [53,940 x 11]
## 
##    carat       cut color clarity depth table price    x    y    z lead_z
## 1   0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43  -0.12
## 2   0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31   0.00
## 3   0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31   0.32
## 4   0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63   0.12
## 5   0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75  -0.27
## 6   0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48  -0.01
## 7   0.24 Very Good     I    VVS1  62.3    57   336 3.95 3.98 2.47   0.06
## 8   0.26 Very Good     H     SI1  61.9    55   337 4.07 4.11 2.53  -0.04
## 9   0.22      Fair     E     VS2  65.1    61   337 3.87 3.78 2.49  -0.10
## 10  0.23 Very Good     H     VS1  59.4    61   338 4.00 4.05 2.39   0.34
## ..   ...       ...   ...     ...   ...   ...   ...  ...  ...  ...    ...
## 
## > diamonds %>% mutate(lag_z = lag(z) - z) %>% tbl_df
## Source: local data frame [53,940 x 11]
## 
##    carat       cut color clarity depth table price    x    y    z lag_z
## 1   0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43    NA
## 2   0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31  0.12
## 3   0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31  0.00
## 4   0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63 -0.32
## 5   0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75 -0.12
## 6   0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48  0.27
## 7   0.24 Very Good     I    VVS1  62.3    57   336 3.95 3.98 2.47  0.01
## 8   0.26 Very Good     H     SI1  61.9    55   337 4.07 4.11 2.53 -0.06
## 9   0.22      Fair     E     VS2  65.1    61   337 3.87 3.78 2.49  0.04
## 10  0.23 Very Good     H     VS1  59.4    61   338 4.00 4.05 2.39  0.10
## ..   ...       ...   ...     ...   ...   ...   ...  ...  ...  ...   ...
## 
## > diamonds %>% mutate(ntile_z = ntile(z, 100)) %>% arrange(desc(ntile_z)) %>% 
## +     tbl_df
## Source: local data frame [53,940 x 11]
## 
##    carat  cut color clarity depth table price    x    y    z ntile_z
## 1   2.14 Fair     J      I1  69.4    57  5405 7.74 7.70 5.36     100
## 2   2.15 Fair     J      I1  65.5    57  5430 8.01 7.95 5.23     100
## 3   2.22 Fair     J      I1  66.7    56  5607 8.04 8.02 5.36     100
## 4   2.01 Fair     I      I1  67.4    58  5696 7.71 7.64 5.17     100
## 5   2.27 Fair     J      I1  67.6    55  5733 8.05 8.00 5.43     100
## 6   2.00 Fair     H      I1  69.8    54  5914 7.60 7.56 5.29     100
## 7   2.03 Fair     H      I1  66.6    57  6002 7.81 7.75 5.19     100
## 8   2.49 Fair     J      I1  66.3    58  6289 8.26 8.18 5.45     100
## 9   2.01 Fair     G      I1  70.2    57  6315 7.53 7.50 5.27     100
## 10  2.14 Fair     H      I1  66.4    56  6328 8.00 7.92 5.29     100
## ..   ...  ...   ...     ...   ...   ...   ...  ...  ...  ...     ...
## 
## > diamonds %>% mutate(ntile_z = ntile(z, 100)) %>% group_by(ntile_z) %>% 
## +     summarise(n = n()) %>% tbl_df
## Source: local data frame [100 x 2]
## 
##    ntile_z   n
## 1        1 540
## 2        2 539
## 3        3 540
## 4        4 539
## 5        5 539
## 6        6 540
## 7        7 539
## 8        8 540
## 9        9 539
## 10      10 539
## ..     ... ...
## 
## > diamonds %>% summarise(mean = mean(x), sum = sum(x, 
## +     y, z), n = n())
##       mean      sum     n
## 1 5.731157 809338.2 53940
## 
## > diamonds %>% group_by(cut, color) %>% summarise(mean = mean(x), 
## +     sum = sum(x, y, z), n = n())
## Source: local data frame [35 x 5]
## Groups: cut
## 
##     cut color     mean      sum   n
## 1  Fair     D 6.018344  2578.89 163
## 2  Fair     E 5.909063  3469.63 224
## 3  Fair     F 5.990513  4901.35 312
## 4  Fair     G 6.173822  5102.83 314
## 5  Fair     H 6.579373  5240.73 303
## 6  Fair     I 6.564457  3019.00 175
## 7  Fair     J 6.747311  2111.40 119
## 8  Good     D 5.620076  9770.35 662
## 9  Good     E 5.617889 13758.40 933
## 10 Good     F 5.693443 13587.47 909
## ..  ...   ...      ...      ... ...
## 
## > diamonds %>% group_by(cut, color) %>% summarise(mean = mean(x), 
## +     sum = sum(x, y, z), n = n()) %>% ungroup %>% summarize(sum(n))
## Source: local data frame [1 x 1]
## 
##   sum(n)
## 1  53940
## 
## > data.frame(x = c(1, 1, 1, 2, 2), y = c(5:1), z = (1:5)) %>% 
## +     arrange(desc(x)) %>% tbl_df
## Source: local data frame [5 x 3]
## 
##   x y z
## 1 2 2 4
## 2 2 1 5
## 3 1 5 1
## 4 1 4 2
## 5 1 3 3
## 
## > data.frame(x = c(1, 1, 1, 2, 2), y = c(5:1), z = (1:5)) %>% 
## +     arrange(desc(x), y) %>% tbl_df
## Source: local data frame [5 x 3]
## 
##   x y z
## 1 2 1 5
## 2 2 2 4
## 3 1 3 3
## 4 1 4 2
## 5 1 5 1
## 
## > diamonds %>% group_by(cut, color) %>% summarise(mean = mean(x), 
## +     sum = sum(x, y, z), n = n()) %>% arrange(n)
## Source: local data frame [35 x 5]
## Groups: cut
## 
##     cut color     mean     sum   n
## 1  Fair     J 6.747311 2111.40 119
## 2  Fair     D 6.018344 2578.89 163
## 3  Fair     I 6.564457 3019.00 175
## 4  Fair     E 5.909063 3469.63 224
## 5  Fair     H 6.579373 5240.73 303
## 6  Fair     F 5.990513 4901.35 312
## 7  Fair     G 6.173822 5102.83 314
## 8  Good     J 6.377003 5139.33 307
## 9  Good     I 6.253544 8568.94 522
## 10 Good     D 5.620076 9770.35 662
## ..  ...   ...      ...     ... ...
## 
## > diamonds %>% group_by(cut, color) %>% summarise(mean = mean(x), 
## +     sum = sum(x, y, z), n = n()) %>% arrange(desc(n), cut, color)
## Source: local data frame [35 x 5]
## Groups: cut
## 
##     cut color     mean      sum   n
## 1  Fair     G 6.173822  5102.83 314
## 2  Fair     F 5.990513  4901.35 312
## 3  Fair     H 6.579373  5240.73 303
## 4  Fair     E 5.909063  3469.63 224
## 5  Fair     I 6.564457  3019.00 175
## 6  Fair     D 6.018344  2578.89 163
## 7  Fair     J 6.747311  2111.40 119
## 8  Good     E 5.617889 13758.40 933
## 9  Good     F 5.693443 13587.47 909
## 10 Good     G 5.850264 13379.44 871
## ..  ...   ...      ...      ... ...
## 
## > diamonds %>% group_by(cut, color, clarity) %>% summarise(mean_carat = mean(carat)) %>% 
## +     ggplot(aes(x = cut, y = mean_carat, color = color)) +  .... [TRUNCATED]

http://www.rstudio.com/resources/cheatsheets/

Type of RestfulReL Oracle Cloud Connections

source("../08 eval(parse vs. json/ParseEval vs JSON.R", echo = TRUE)

Joining Data

source("../09 Joining Data/Joining Data.R", echo = TRUE)
## 
## > require("jsonlite")
## 
## > require(dplyr)
## 
## > ddf <- data.frame(fromJSON(getURL(URLencode("129.152.144.84:5001/rest/native/?query=\"select * from DIAMONDS\""), 
## +     httpheader = c(DB = "jdbc:o ..." ... [TRUNCATED] 
## 
## > tbl_df(ddf)
## Source: local data frame [53,940 x 11]
## 
##    diamond_id carat       cut color clarity depth tbl price    x    y    z
## 1           1  0.23     Ideal     E     SI2  61.5  55  null 3.95 3.98 2.43
## 2           2  0.21   Premium     E     SI1  59.8  61   326 3.89 3.84 2.31
## 3           3  0.23      Good     E     VS1  56.9  65   327 4.05 4.07 2.31
## 4           4  0.29   Premium     I     VS2  62.4  58   334 4.20 4.23 2.63
## 5           5  0.31      Good     J     SI2  63.3  58   335 4.34 4.35 2.75
## 6           6  0.24 Very Good     J    VVS2  62.8  57   336 3.94 3.96 2.48
## 7           7  0.24 Very Good     I    VVS1  62.3  57   336 3.95 3.98 2.47
## 8           8  0.26 Very Good     H     SI1  61.9  55   337 4.07 4.11 2.53
## 9           9  0.22      Fair     E     VS2  65.1  61   337 3.87 3.78 2.49
## 10         10  0.23      null     H     VS1  59.4  61   338 4.00 4.05 2.39
## ..        ...   ...       ...   ...     ...   ... ...   ...  ...  ...  ...
## 
## > sdf <- data.frame(fromJSON(getURL(URLencode("129.152.144.84:5001/rest/native/?query=\"select * from diam_sale\""), 
## +     httpheader = c(DB = "jdbc: ..." ... [TRUNCATED] 
## 
## > tbl_df(sdf)
## Source: local data frame [64,839 x 4]
## 
##    SALE_ID RETAILER_ID DIAMOND_ID          SALES_DATE
## 1      918          16        769 2013-05-10 00:00:00
## 2      919          41        770 2010-12-15 00:00:00
## 3      920          40        771 2010-02-22 00:00:00
## 4      921          33        772 2009-11-05 00:00:00
## 5      922          33        772 2009-11-05 00:00:00
## 6      923          46        773 2011-11-07 00:00:00
## 7      924           9        774 2011-01-14 00:00:00
## 8      925          16        775 2011-12-23 00:00:00
## 9      926          16        775 2011-12-23 00:00:00
## 10     927          26        776 2011-12-25 00:00:00
## ..     ...         ...        ...                 ...
## 
## > rdf <- data.frame(fromJSON(getURL(URLencode("129.152.144.84:5001/rest/native/?query=\"select * from diam_retailer\""), 
## +     httpheader = c(DB = "j ..." ... [TRUNCATED] 
## 
## > tbl_df(rdf)
## Source: local data frame [50 x 2]
## 
##    RETAILER_ID                  NAME
## 1            0             ZALE CORP
## 2            1 STERLING JEWELERS INC
## 3            2   FRED MEYER JEWELERS
## 4            3     HELZBERG DIAMONDS
## 5            4      ULTRA STORES INC
## 6            5      SAMUELS JEWELERS
## 7            6            TIFFANY CO
## 8            7    ROGERS ENTERPRISES
## 9            8    BEN BRIDGE JEWELER
## 10           9           DON ROBERTO
## ..         ...                   ...
## 
## > names(ddf)
##  [1] "diamond_id" "carat"      "cut"        "color"      "clarity"   
##  [6] "depth"      "tbl"        "price"      "x"          "y"         
## [11] "z"         
## 
## > names(sdf)
## [1] "SALE_ID"     "RETAILER_ID" "DIAMOND_ID"  "SALES_DATE" 
## 
## > names(rdf)
## [1] "RETAILER_ID" "NAME"       
## 
## > colnames(ddf) <- toupper(names(ddf))
## 
## > dsdf <- inner_join(ddf, sdf, by = "DIAMOND_ID")
## 
## > inner_join(dsdf, rdf, by = "RETAILER_ID") %>% tbl_df
## Source: local data frame [64,839 x 15]
## 
##    DIAMOND_ID CARAT       CUT COLOR CLARITY DEPTH TBL PRICE    X    Y    Z
## 1           1  0.23     Ideal     E     SI2  61.5  55  null 3.95 3.98 2.43
## 2           1  0.23     Ideal     E     SI2  61.5  55  null 3.95 3.98 2.43
## 3           2  0.21   Premium     E     SI1  59.8  61   326 3.89 3.84 2.31
## 4           3  0.23      Good     E     VS1  56.9  65   327 4.05 4.07 2.31
## 5           4  0.29   Premium     I     VS2  62.4  58   334 4.20 4.23 2.63
## 6           5  0.31      Good     J     SI2  63.3  58   335 4.34 4.35 2.75
## 7           5  0.31      Good     J     SI2  63.3  58   335 4.34 4.35 2.75
## 8           6  0.24 Very Good     J    VVS2  62.8  57   336 3.94 3.96 2.48
## 9           7  0.24 Very Good     I    VVS1  62.3  57   336 3.95 3.98 2.47
## 10          8  0.26 Very Good     H     SI1  61.9  55   337 4.07 4.11 2.53
## ..        ...   ...       ...   ...     ...   ... ...   ...  ...  ...  ...
## Variables not shown: SALE_ID (int), RETAILER_ID (int), SALES_DATE (fctr),
##   NAME (fctr)
## 
## > colnames(ddf) <- toupper(names(ddf))
## 
## > inner_join(ddf, sdf, by = "DIAMOND_ID") %>% inner_join(., 
## +     rdf, by = "RETAILER_ID") %>% ggplot(aes(x = CARAT, y = NAME, 
## +     color = CUT)) + .... [TRUNCATED]

## 
## > joindf <- data.frame(fromJSON(getURL(URLencode("129.152.144.84:5001/rest/native/?query=\"select * from DIAMONDS d join diam_sale s on (d.\\\"diamond ..." ... [TRUNCATED] 
## 
## > joindf %>% ggplot(aes(x = carat, y = NAME, color = cut)) + 
## +     geom_point()

http://www.rstudio.com/resources/cheatsheets/

Lists Indexing

source("../10 ListsForIfFunctionsPng/List Indexing.R", echo = TRUE)
## 
## > l <- list(a = 1:10, b = 11:20)
## 
## > l
## $a
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $b
##  [1] 11 12 13 14 15 16 17 18 19 20
## 
## 
## > l[1]
## $a
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## 
## > l[2]
## $b
##  [1] 11 12 13 14 15 16 17 18 19 20
## 
## 
## > l[[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## > l[[2]]
##  [1] 11 12 13 14 15 16 17 18 19 20
## 
## > l$a
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## > l$b
##  [1] 11 12 13 14 15 16 17 18 19 20
## 
## > ll <- list(1:10, 11:20)
## 
## > ll
## [[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## [[2]]
##  [1] 11 12 13 14 15 16 17 18 19 20
## 
## 
## > ll[1]
## [[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## 
## > ll[2]
## [[1]]
##  [1] 11 12 13 14 15 16 17 18 19 20
## 
## 
## > ll[[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## > ll[[2]]
##  [1] 11 12 13 14 15 16 17 18 19 20
## 
## > lll <- list()
## 
## > lll
## list()
## 
## > lll[["a"]] <- 111
## 
## > lll
## $a
## [1] 111
## 
## 
## > lll[1]
## $a
## [1] 111
## 
## 
## > lll[["a"]]
## [1] 111
## 
## > lll[[1]]
## [1] 111

For more details on [[…]], see http://stackoverflow.com/questions/1169456/in-r-what-is-the-difference-between-the-and-notations-for-accessing-the

Lists, For and If Statements, Functions, and generating png Files

source("../10 ListsForIfFunctionsPng/ListsForIfFunctionsPng.R", echo = TRUE)
## 
## > require("tidyr")
## 
## > require("dplyr")
## 
## > require("jsonlite")
## 
## > q = "Good"
## 
## > r <- data.frame(fromJSON(getURL(URLencode("129.152.144.84:5001/rest/native/?query=\"select * from diamonds where \\\"cut\\\" = \\'\"q\"\\'\""), 
## +   .... [TRUNCATED] 
## 
## > myplot <- function(df, x) {
## +     names(df) <- c("x", "n")
## +     ggplot(df, aes(x = x, y = n)) + geom_point()
## + }
## 
## > categoricals <- eval(parse(text = substring(getURL(URLencode("http://129.152.144.84:5001/rest/native/?query=\"select * from diamonds\""), 
## +     htt .... [TRUNCATED] 
## 
## > ddf <- data.frame(fromJSON(getURL(URLencode("129.152.144.84:5001/rest/native/?query=\"select * from DIAMONDS\""), 
## +     httpheader = c(DB = "jdbc:o ..." ... [TRUNCATED] 
## 
## > l <- list()
## 
## > for (i in names(ddf)) {
## +     if (i %in% categoricals[[1]]) {
## +         r <- data.frame(fromJSON(getURL(URLencode("129.152.144.84:5001/rest/native/? ..." ... [TRUNCATED]

## 
## > png("/Users/pcannata/Mine/UT/GitRepositories/DataVisualization/RWorkshop/00 Doc/categoricals.png", 
## +     width = 25, height = 10, units = "in", res .... [TRUNCATED] 
## 
## > grid.newpage()

## 
## > pushViewport(viewport(layout = grid.layout(1, 12)))
## 
## > print(l[[1]], vp = viewport(layout.pos.row = 1, layout.pos.col = 1:4))
## 
## > print(l[[2]], vp = viewport(layout.pos.row = 1, layout.pos.col = 5:8))
## 
## > print(l[[3]], vp = viewport(layout.pos.row = 1, layout.pos.col = 9:12))
## 
## > dev.off()
## quartz_off_screen 
##                 2 
## 
## > myplot1 <- function(df, x) {
## +     names(df) <- c("x")
## +     ggplot(df, aes(x = x)) + geom_histogram()
## + }
## 
## > l <- list()
## 
## > for (i in names(ddf)) {
## +     if (i %in% categoricals[[2]]) {
## +         r <- data.frame(fromJSON(getURL(URLencode("129.152.144.84:5001/rest/native/? ..." ... [TRUNCATED]
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## 
## > png("/Users/pcannata/Mine/UT/GitRepositories/DataVisualization/RWorkshop/00 Doc/categoricals2.png", 
## +     width = 25, height = 20, units = "in", re .... [TRUNCATED] 
## 
## > grid.newpage()
## 
## > pushViewport(viewport(layout = grid.layout(2, 12)))
## 
## > print(l[[1]], vp = viewport(layout.pos.row = 1, layout.pos.col = 1:3))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## 
## > print(l[[2]], vp = viewport(layout.pos.row = 1, layout.pos.col = 4:6))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## 
## > print(l[[3]], vp = viewport(layout.pos.row = 1, layout.pos.col = 7:9))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## 
## > print(l[[4]], vp = viewport(layout.pos.row = 1, layout.pos.col = 10:12))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## 
## > print(l[[5]], vp = viewport(layout.pos.row = 2, layout.pos.col = 1:3))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## 
## > print(l[[6]], vp = viewport(layout.pos.row = 2, layout.pos.col = 4:6))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## 
## > print(l[[7]], vp = viewport(layout.pos.row = 2, layout.pos.col = 7:9))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## 
## > print(l[[8]], vp = viewport(layout.pos.row = 2, layout.pos.col = 10:12))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## 
## > dev.off()
## quartz_off_screen 
##                 2

Beautiful Plotting

Based upon http://www.cs.utexas.edu/~cannata/dataVis/Class%20Notes/Beautiful%20plotting%20in%20R_%20A%20ggplot2%20cheatsheet%20_%20Technical%20Tidbits%20From%20Spatial%20Analysis%20&%20Data%20Science.pdf

source("../11 Beautiful Plotting/Beautiful Plotting.R", echo = TRUE)
## 
## > require("tidyr")
## 
## > require("dplyr")
## 
## > require("jsonlite")
## 
## > i = "cut"
## 
## > r <- data.frame(fromJSON(getURL(URLencode("129.152.144.84:5001/rest/native/?query=\"select \\\"\"i\"\\\", count(*) n from DIAMONDS group by \\\"\"i\ .... [TRUNCATED] 
## 
## > names(r) <- c("x", "n")
## 
## > g <- ggplot(r, aes(x = x, y = n)) + geom_point(color = "firebrick")
## 
## > g

## 
## > g <- g + ggtitle("Cut")
## 
## > g

## 
## > g + labs(title = "Cut")

## 
## > g <- g + theme(plot.title = element_text(size = 20, 
## +     face = "bold", vjust = 2))
## 
## > g
## 
## > require(extrafont)
## Loading required package: extrafont
## Registering fonts with R

## 
## > g <- g + theme(plot.title = element_text(size = 30, 
## +     face = "bold", vjust = 1, family = "Bauhaus93"))
## 
## > g

## 
## > g <- g + labs(title = "This is a longer\ntitle than expected")
## 
## > g

## 
## > g <- g + theme(plot.title = element_text(size = 30, 
## +     face = "bold", vjust = 1, lineheight = 1))
## 
## > g

## 
## > g <- g + labs(x = "Cut", y = paste("Cut", "Numbers"))
## 
## > g

## 
## > g + theme(axis.ticks.y = element_blank(), axis.text.y = element_blank())

## 
## > g <- g + theme(axis.text.x = element_text(angle = 50, 
## +     size = 20, vjust = 0.5))
## 
## > g

## 
## > g <- g + theme(axis.title.x = element_text(color = "forestgreen", 
## +     vjust = 0.35), axis.title.y = element_text(color = "cadetblue", 
## +     vjus .... [TRUNCATED] 
## 
## > g

## 
## > g + ylim(c(0, 10000))
## Warning: Removed 3 rows containing missing values (geom_point).

## 
## > g + scale_y_continuous(limits = c(0, 10000))
## Warning: Removed 3 rows containing missing values (geom_point).

## 
## > g + scale_y_continuous(label = function(x) {
## +     return(paste("My cut number is ", x))
## + }, limits = c(0, 10000))
## Warning: Removed 3 rows containing missing values (geom_point).

## 
## > r <- data.frame(fromJSON(getURL(URLencode("129.152.144.84:5001/rest/native/?query=\"select \\\"clarity\\\", \\\"\"x\"\\\", count(*) n from DIAMONDS  .... [TRUNCATED] 
## 
## > names(r) <- c("legend", "x", "n")
## 
## > g <- ggplot(r, aes(x = x, y = n, size = legend, color = legend)) + 
## +     geom_point()
## 
## > g

## 
## > g <- g + labs(title = "Cut") + labs(x = "Cut", y = paste("Cut", 
## +     "Numbers")) + theme(axis.title.x = element_text(color = "forestgreen", 
## +     .... [TRUNCATED] 
## 
## > g

## 
## > g + theme(legend.title = element_blank())

## 
## > g + theme(legend.title = element_text(colour = "chocolate", 
## +     size = 16, face = "bold"))

## 
## > g + scale_color_discrete(name = "The types of\nclarity are:")

## 
## > s <- g + theme(legend.key = element_rect(fill = "blue"))
## 
## > s

## 
## > s <- g + theme(legend.key = element_rect(fill = NA))
## 
## > s

## 
## > g + guides(colour = guide_legend(override.aes = list(size = 6)), 
## +     show_guide = FALSE)

## 
## > s <- g + geom_histogram(aes(x = x))
## 
## > s

## 
## > s <- g + geom_histogram(aes(x = x), show_guide = FALSE)
## 
## > s

## 
## > g <- ggplot(r, aes(x = x, y = n)) + geom_line(color = "grey") + 
## +     geom_point(color = "red")
## 
## > g

## 
## > g <- g + geom_line(aes(color = "Important line")) + 
## +     geom_point(aes(color = "My points"))
## 
## > g

## 
## > g <- g + scale_colour_manual(name = "", values = c(`Important line` = "red", 
## +     `My points` = "blue"))
## 
## > g

## 
## > g <- ggplot(r, aes(x = x, y = n, size = legend, color = legend)) + 
## +     geom_point()
## 
## > g

## 
## > g <- g + theme(panel.background = element_rect(fill = "grey75"))
## 
## > g

## 
## > g <- g + theme(panel.grid.major = element_line(colour = "orange", 
## +     size = 2), panel.grid.minor = element_line(colour = "blue"))
## 
## > g

## 
## > g <- g + theme(plot.background = element_rect(fill = "blue"))
## 
## > g

## 
## > g <- g + theme(plot.margin = unit(c(1, 6, 1, 6), "cm"))
## 
## > g

## 
## > g <- ggplot(r, aes(x = x, y = n, size = legend, color = legend)) + 
## +     geom_point() + facet_wrap(~legend, nrow = 1)
## 
## > g

## 
## > ggplot(r, aes(x = x, y = n, size = legend, color = legend)) + 
## +     geom_point() + facet_wrap(~legend, nrow = 1, scale = "free")

## 
## > p1 <- ggplot(r, aes(x = legend, y = n, size = x, color = x)) + 
## +     geom_point()
## 
## > p2 <- ggplot(r, aes(x = x, y = n, size = legend, color = legend)) + 
## +     geom_point()
## 
## > require(grid)
## 
## > pushViewport(viewport(layout = grid.layout(1, 2)))
## 
## > print(p1, vp = viewport(layout.pos.row = 1, layout.pos.col = 1))
## 
## > print(p2, vp = viewport(layout.pos.row = 1, layout.pos.col = 2))
## 
## > g <- ggplot(r, aes(x = x, y = n, size = legend, color = legend)) + 
## +     geom_point() + facet_wrap(~legend, nrow = 1)
## 
## > g
## 
## > require(ggthemes)
## Loading required package: ggthemes

## 
## > g + theme_economist() + scale_colour_economist()

## 
## > ggplot(r, aes(x = x, y = n, size = legend, color = legend)) + 
## +     geom_point() + scale_color_manual(values = c("dodgerblue4", 
## +     "darkolivegr ..." ... [TRUNCATED]

## 
## > ggplot(r, aes(x = x, y = n, size = legend, color = legend)) + 
## +     geom_point() + scale_color_brewer(palette = "Set1")

## 
## > ggplot(r, aes(x = x, y = n, size = legend, color = legend)) + 
## +     geom_point() + scale_colour_tableau()

## 
## > ggplot(r, aes(x = x, y = n, color = n)) + geom_point() + 
## +     scale_color_gradient(low = "darkkhaki", high = "darkgreen")
## 
## > mid <- mean(r$n)
## 
## > ggplot(r, aes(x = x, y = n, color = n)) + geom_point() + 
## +     scale_color_gradient2(midpoint = mid, low = "green", mid = "yellow", 
## +         high .... [TRUNCATED]
## Warning: Non Lab interpolation is deprecated

## 
## > require(grid)
## 
## > my_grob = grobTree(textGrob("This text stays in place!", 
## +     x = 0.1, y = 0.95, hjust = 0, gp = gpar(col = "blue", fontsize = 15, 
## +         font .... [TRUNCATED] 
## 
## > g <- ggplot(r, aes(x = x, y = n, color = n)) + geom_point() + 
## +     scale_color_gradient2(midpoint = mid, low = "green", mid = "yellow", 
## +         .... [TRUNCATED]
## Warning: Non Lab interpolation is deprecated

## 
## > g

## 
## > g + facet_wrap(~legend)
## 
## > g <- ggplot(r, aes(x = x, y = n, color = n)) + geom_point() + 
## +     scale_color_gradient2(midpoint = mid, low = "green", mid = "yellow", 
## +         .... [TRUNCATED]
## Warning: Non Lab interpolation is deprecated

## 
## > g

## 
## > g + coord_flip()

## 
## > ggplot(r, aes(x = x, y = n)) + geom_boxplot(fill = "darkseagreen4")

## 
## > p1 <- ggplot(r, aes(x = x, y = n)) + geom_point()
## 
## > p2 <- ggplot(r, aes(x = x, y = n)) + geom_jitter(alpha = 1, 
## +     aes(color = legend), position = position_jitter(width = 0.3))
## 
## > require(grid)
## 
## > pushViewport(viewport(layout = grid.layout(1, 3)))
## 
## > print(p1, vp = viewport(layout.pos.row = 1, layout.pos.col = 1))
## 
## > print(p2, vp = viewport(layout.pos.row = 1, layout.pos.col = 2:3))
## 
## > g <- ggplot(r, aes(x = x, y = n)) + geom_jitter(alpha = 1, 
## +     aes(color = legend), position = position_jitter(width = 0.1)) + 
## +     geom_violin .... [TRUNCATED] 
## 
## > g

## 
## > g + coord_flip()
## 
## > joindf <- data.frame(fromJSON(getURL(URLencode("129.152.144.84:5001/rest/native/?query=\"select * from DIAMONDS d join diam_sale s on (d.\\\"diamond ..." ... [TRUNCATED] 
## 
## > require(lubridate)
## Loading required package: lubridate
## 
## Attaching package: 'lubridate'
## 
## The following object is masked from 'package:plyr':
## 
##     here

## 
## > ggplot(joindf, aes(x = day(SALES_DATE), y = tbl)) + 
## +     geom_point() + stat_smooth()
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

Bokey

Based upon http://hafen.github.io/rbokeh/

# source("../12 Bokeh/Bokeh.R", echo = TRUE)